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Abstract 

In compositional data, an observation is a vector with non-negative components which sum 
to a constant, typically 1. Data of this type arise in many areas, such as geology, archaeology, 
biology, economics and political science among others. The goal of this paper is to extend the 
taxicab metric and a newly suggested metric for compositional data by employing a power 
transformation. Both metrics are to be used in the fc-nearest neighbours algorithm regardless 
of the presence of zeros. Examples with real data are exhibited. 
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1 Introduction 


Compositional data are non-negative multivariate data and each vector sums to the same 


constant, usually 1 for rnnvpnipnpp Cnmpnsitinna 


ala_are met in many dispiplinps iTipliirl 


ing geology flAitchisonI ppnnnmip^ flFry pt a1 l2nnnl) , archaeology fiBaxter et al.l . 120051 ) 

and political sciences fiRodriqnes and Limal. 120091) . Their sample space is called simplex S''^ 
and in mathematical terms is 


S‘^ = < {xi, 


D 


> 0,^X4 = 1 


where D den otes the number of components and d = D — 1. 

Ever since lAitchisonI fll982[) suggested the use of the log-ratio transformation for composi- 
tiona l data, mo s t of t he analyses of such data have been implemented using this transforma¬ 


tion. lAitchisonI (120031) implemented linear discriminant analysis for compositional data using 
the log-ratio transformation. Over the years though, researchers have sug gested altern ative 


ways for superv i sed c 


Neocleons et al.l fl201l[) 


assihcation of compositional data, see for example 


Gallol ( 2010h and 


An important issue in compositional data is the presence of zeros, which cause problems 
for the logarithmic transformatio n. The issue of zero v alues in some components is not 
addressed in most papers, but see iNeocleons et al.l (120111) for an example of discrimination 


Scealv and Welsh. 

2011a 

and 

Stewart and Field. 

2011) 


d use alternative models (see for example 


20111) or replace the zero values by making 


parametric assumptions (IMartin et al.l . 120121) . 

In this paper we suggest the use of a recently developed metric, for classihcation of 
compositional data, when the /c-nearest neighbours (fc-NN) algorithm is implemented. It is 
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a me tric for probability distributions flEndres and Schindelinl . 


2003 


Osterreicher and Vaidal . 


20031) which can be adopted to compositional data as well, since each vector sums to 1. The 


second metric we suggest is the Manha ttan m e tric, a scaled version of which has already been 
used for compositional data analysis (iMilleii. 1200211 . We will extend both of these metrics 
by applying a power transformation. We will see that both of these metrics handle zeros 
naturally and hence they can be used regardless of them being present in some components. 
This is a ver y attracti v e feat ure of these metrics in contrast to the Aitchisonian metric 
suggested by lAitchisonI (120031) which is not applicable when zeros are present in the data. 
Examples using real data are used to illustrate the performance of these metrics. 

Section 2 describes the two metrics, how they can be extended and also presents graphi¬ 
cally their loci of points equidistant from the centre of the simplex. Section 3 shows the fc-NN 
algorithm for compositional data and Section 4 contains examples using real data. Finally, 
section 5 concludes this paper. 


2 Metrics for compositional data 


We will present three metrics for compositional data ,twp of which haA/’e .already been exam¬ 
ined. But first we will show the power transformation. lAitchisonI (120031) defined the power 
transformation to be 








( 1 ) 


The value of a will be determined by the estimated accuracy of the fc-NN algorithm. 


2.1 The ES-OVct metric for compositional data 

We advocate that as a measure of the distance between two compositions we can use the 
square root of the Jensen-Shannon divergence 


ES — OE(x, w) 


D 


.i=l 




‘2xi 


Xi + Wi 


+ Wi log 


Xi + W^J 


1/2 


( 2 ) 


w here x, w G 5*° 


Endres and Schindelinl (l2003l ) and lOsterreicher and Vaidal (120031 1 proved, independently. 


that ([2]) satisfies the triangular identity and thus it is a metric. For this reason we will refer 


to it as the ES-OV metric. 

We will use the power transformation ([T]) to define a more general metric termed ES-OVq 
metric 


ES — OVa (x, w) = 


D 


log- 








1/2 


Wi 




log- 








(3) 


2 









































2.2 The taxicabck metric for compositional data 

The taxicab metric is also known as Li (or Manhattan) metric and is defined as 

D 

TC {x,w) ='^\xi - Wi\ (4) 

i^l 

We will again employ the power transformation ([T]) to define a more general metric which 
we will term the TCq, metric 


TCa (x, w) 



Wi 


s" 1 


(5) 


2.3 The Aitchisonian metric for compositional data 


AitchisonI (120031) suggested the Euclidean metric applied to the log-ratio transformed data 


as a measure of distance between compositions 


Ait (x, w) = 


D 


E log 


, 2=1 


' 9 i^) 


- log 


5(w) 


-I 1/2 


( 6 ) 


where g (z) = stands for the geometric mean. 


2.4 Some comments 


The power transformed compositional vectors still sum to 1 and thus the ES-OVo ([3]) is still 
a metric. It becomes clear that when a = 1 we end up with the ES-OV metric ([2]). If on the 
other hand a = 0, then the distance is zero, since the compositional vectors become equal 
to the centre of the simplex. An advantage of the ES-OVq metric ([3]) over the Aitchisonian 
metric ([6]) is that the the first one is defined even when zero values are present. In this case 
the Aitchisonian metric ([6]) becomes degenerate and thus cannot be used. We have to note 
that we need to scale the data so that they sum to 1 in the case of the ES-OV metric, but 
this is not a requirement of the taxicab metric. 

Alternative metrics could be used as well, such as 


1. the Hellinger metric flOwenl . 1200111 


H(x,w) = ^ 


■ D 

.i=l 


1/2 


2. or the angular metric if we trpat rnmpnRitional data as Hirpptinnal data (for mp 
mation about this approach see IStephensI (1198211 and IScealv and WelshI (12011 


X-Luior- 

20141 11 


Ang (x, w) = arccos 


XiWt 
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Aitchisod (119921) argued that a simplicial metric should satisfy certain properties. These 


properties include 


1. Scale invariance. The requirement here is that the measure used to dehne the distance 
between two compositional data vectors should be scale invariant, in the sense that 
it makes no difference whether the compositions are represented by proportions or 
percentages. 

2. Subcompositional dominance. To explain this we consider two compositional data vec¬ 
tors and we select sub-vectors from each consisting of the same components. Subcom¬ 
positional dominance means that the distance between the sub-vectors is always less 
than or equal to the distance between the original compositional vectors. 

3. Perturbation invariance. The requirement here is that the distance between composi¬ 
tional vectors x and w should be the same as distance between x ©o p and w ©o p, 
where the operator ©q means element-wise multiplication and then division by the 
sum so that the resulting vectors belong to S'^ and p is any vector (not necessarily 
compositional) with positive components. 

If all of the above metrics satisfy or not these thee properties should not be a problem. 
Take for example subcompositional dominance. If someone has a compositional dataset, 
there has to be a good reason why he would choose to discard some components and form a 
sub-composition. And even if he does, all the metrics are still applicable. 

The message this paper tries to convey is that if someone uses a well dehned metric (or 
even a dissimilarity measure) in order to perform classihcation he should be hne with that. 
When dealing with data lying on the Euclidean space, one can use dissimilarity measures 
as well to perform clustering or discrimination. The question of interest is how can we 
discriminate the observed groups of points as adequately as possible. 


2.5 Loci of points equidistant from the centre of the simplex 

Figure [H shows the effect of the power transformation ([T]) on the data. As expected, the data 
come closer to the barycentre of the tr iangle as a tend s to zero. The data used and plotted 
on Figure [U are the Arctic lake data flAitchisonl . 120031) . Figures [2] and [3] show the plots of 
loci of points of the ES-OVa metric (E]) and of the TC^ metric (E]) for different values of a 
and Figure m shows the contour plots of the Aitchisonian metric (E]). In all cases, the plots 
of loci of points refer to the distance from the barycentre of the simplex. The loci of points 
seen on Figure |2] have similar shape regardless of the value of a. This is not true for the loci 
in Figure El which change as the value of a changes. 
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(d) 


(e) 


(f) 


Figure 1: Ternary plots of the Arctic lake data ( Ait chisel . 2003) for different values of a. 
The data are transformed calculated using (a) a = —1, (b) a = —0.5, (c) a = —0.1, (d) 
a = 0.1, (e) a = 0.5 and (f) a = 1. 





(a) (b) (c) 





(d) (e) (f) 

Figure 2: Loci of points equidistant from the centre of the simplex using the ESOVq metric 
(]3|). In all cases the distances are from the barycentre of the simplex (1/3,1/3,1/3). The 
contours are calculated using (a) a = —1, (b) a = —0.5, (c) a = —0.1, (d) a = 0.1, (e) 
a = 0.5 and (f) a = 1. 
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(a) (b) 




(c) 



(d) (e) (f) 

Figure 3: Loci of points equidistant from the centre of the simplex using the TCq, metric (l5i) . 
In all cases the distances are from the barycentre of the simplex (1/3,1/3,1/3). The contours 
are calculated using (a) a = —1, (b) a = —0.5, (c) a = —0.1, (d) a = 0.1, (e) a = 0.5 and 
(f) a = 1. 



Figure 4: Loci of points equidistant from the centre of the simplex using the Aitchisonian 
metric ([6]). 

3 Supervised classification for compositional data us¬ 
ing the /c-NN algorithm 

The goal of this paper is to perform supervised classihcation of compositional data using 
the fc-NN algorithm. For this reason we will use the ES-OVq ([3]) and TCq ([5]) metrics and 
compare their performance and suitability with the Aitchisonian metric metric (1^ . 

The fc-NN algorithm is a non-parametric supervised learning technique which is compu¬ 
tationally heavier than quadratic and linear discriminant analysis but easier to implement as 
it relies solely on metrics between points. 

Similarly to other supervised classihcation techniques it requires some parameter tuning. 
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The two parameters associated with it in our case are the power parameter a and the number 
of nearest neighbours k. We describe the steps of the fc-NN for compositional data in our 
case. 


1. Separate the data into the training and the test dataset. 

2. Choose a value of fc, the number of nearest neighbours. 

3. Classify the test data using either the ES-OVq, ([3]), the TCq, ([5]) for a range of values 
of a and each time calculate the percentage of correct classihcation. 

4. Repeat steps 2 — 3 for a different value of k. 

5. Repeat steps 1 — 4 B (in our case B = 200) times and for each a and k and estimate 
the percentage of correct classification by averaging over all B times. 


We can of course use the Aitchisonian metric ([6]) instead of the ES-OVq, ([3]) or the TCq 
metric (jSj). In this case we have to choose the number of nearest neighbours only, since no 
power transformation is involved. We could of course use any other met ric dehned in R^ . In 
this case we would have to apply the additive log-ratio transformation flAitchisonl . l2003h to 
the data. The issue in that case though would be the presence of zeros in the data. 

In the next section we will see two examples using real data and see the performance of 
the algorithm when each of the two metrics is used. 


3.1 Examples using real data 

We will now see the performance of the fc-NN algorithm using the ES-OVq metric (jS]), the 
TCq metric and the Aitchisonian metric (jS]) with real data. 


Example 1. Hydrochemical data with no zero values 


The hrst dataset comes from hydrochemistry. A hydrochemical data set flOtero et al.l 1200511 
contains measurements on 14 elements, the data were gathered within a period of 2 years 
from 31 stations located along the rivers and main tributaries of the Llobregat river, one of 
the medium rivers in northeastern Spain. Each of these elements is measured approximately 
once each month during these 2 years. There are 4 tributaries of interest, Anoia (143 mea¬ 
surements), Cardener (95 measurements). Upper Llobregat (135 measurements) and Lower 
Llobregat (112 measurements). Thus, there are 485 across all 4 tributaries. 

This dataset contains no zero values, so all three metrics are applicable. The size of the 
training sample was equal to 434 and thus the test sample consisted of 51 observations, which 
were sampled using stratified random sampling each time to ensure that observations from all 
tributaries are selected every time. Figure [5] shows the heat plot of the estimated percentage 
for different values of k and a. 

If a = 0.5 and k = 2 the estimated percentage percentage of correct classihcation is equal 
to 92.78% and when a = 1 and k = 3 the estimated percentage is 89.88% when the ES-OVq 
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Figure 5: The estimated percentage of correct classification for the hydrochemical data as a 
function of k, the nearest neighbours and of a using the (a) ES-OVq metric ([3]) and (b) TCq 

( 0 . 



2 4 6 8 10 

k (number of nearest neighbours) 


Figure 6: The estimated percentage of correct classihcation as a function of k. The black 
and the red lines are based on the ES-OV^ metric (E]) with a = 0.5 and a = 1 respectively. 
The green and the blue lines are based on the TC^ metric (E]) with a = 0.35 and a = 1 
respectively. The turquoise line is the Aitchisonian metric (E]). 


metric (j3]) was applied. When the TCq, metric (|5]) is applied the results are similar, with 
a = 0.35 and k = 2 the estimated percentage of correct classihcation is 93.77% and when 
a = 1 and k = 2, the estimated percentage of correct classihcation is 86.55%. This is an 
example where a value of a other than 1 leads to better results. The change in the percentage 
might seem small, but if we take into account the total sample size, we will see that the 3% 
of 485 observations is 14 observations and it is not a small number. The Aitchisonian metric 
on the other hand did not do that well. The maximum estimated percentage was equal to 
85.46% when k = 2. 

More information (including the specihcities and sensitivities for each tributary averaged 
over all 200 replications) regarding the classihcation results is presented in Tabled] below. A 

8 














general conclusion about the mean sensitivities and specificities is that the lower sensitivities 
are observed when the estimated percentage of correct classification is lower and they have 
also larger standard errors. The mean specificities on the other hand are in general high and 
are less affected by the estimated percentage of correct classification. 


ES-OVq metric 

Tuning parameters 

Percentage of 

Tributaries 

Sensitivities 

Specificities 


correct classification 




a = 0.5 & k=2 

92.78% (3.25%) 

Anoia 
Gardener 
Upper Llobregat 
Lower Llobregat 

95.77%(4.92%) 

85.25%(10.65%) 

93.93%(6.07%) 

94.00%(6.48%) 

98.60%(2.14%) 

97.06%(2.32%) 

97.58%(2.34%) 

97.24%(2.37%) 

a = 1 & k=3 

89.88% (3.96%) 

Anoia 
Gardener 
Upper Llobregat 
Lower Llobregat 

93.57%(5.92%) 

82.10%(12.82%) 

91.50%(7.42%) 

89.88%(8.39%) 

97.17%(2.83%) 

96.13%(2.83%) 

96.42%(2.84%) 

96.85%(2.63%) 

TCq metric 

Tuning parameters 

Percentage of 

Tributaries 

Sensitivities 

Specificities 


correct classification 




a = 0.35 & k=2 

93.77% (3.13%) 

Anoia 
Gardener 
Upper Llobregat 
Lower Llobregat 

96.73%(4.58%) 

87.80%(10.28%) 

94.18%(5.86%) 

94.58%(6.18%) 

98.60% (1.83%) 
97.66%(2.24%) 
97.85%(2.30%) 
97.65%(2.24%) 

a = 1 k=2 

86.55% (4.71%) 

Anoia 

Gardener 
Upper Llobregat 
Lower Llobregat 

90.03%(7.41%) 

79.70%(13.45%) 

85.54%(8.84%) 

89.08%(9.47%) 

95.99%(3.52%) 

96.56%(2.66%) 

95.95%(3.12%) 

93.58%(3.69%) 

Aitchisonian metric 


Percentage of 

Tributaries 

Sensitivities 

Specificities 


correct classification 





85.46% (5.07%) 

Anoia 
Gardener 
Upper Llobregat 
Lower Llobregat 

87.40%(8.63%) 

77.65%(12.68%) 

89.89%(7.69%) 

84.38%(9.88%) 

96.25%(2.94%) 

95.91%(2.86%) 

93.95%(3.75%) 

94.49%(3.72%) 


Table 1: Classification results for the hydrochemical data. The number inside the parentheses 
indicates the standard error of the percentages. 


In addition we calculated the ROC curves for each of the three metrics. In order to do this 
we performed a 1-fold cross validation. That is, we removed an observation and then using 
the parameters a and k which are given in Table [T] (since they produced the best results) 
we classified it. This procedure was repeated for all observations. Thus, we ended up with 
the predicted membership values for all observations based on the 3 metrics. This allowed 
us to draw the ROC curves for each tributary when all 3 metrics were used. The results are 
presented in Figure [71 

We can see that for all tributaries the ROC curves of the ES-OVq, metric ([3]) and the TCq, 
metric (j5]) are similar, whereas the ROC curve of the Aitchisonian metric ([6]) is always the 
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lowest. 






Figure 7: ROC curves for all tributaries using the three metrics For ES-OVq, we used a = 0.35 
and k = 2 and for TCq, we used a = 0.5 and k = 2. For Each plot corresponds to one of the 
four tributaries (a) Anoia, (b) Gardener, (c) Upper Llobregat and (d) Upper Llobregat. 


Example 2. Forensic glass data with zero values 

In the second example we will use the forensic glass dataset which has 214 observations from 
6 different categories of glass with 8 chemical elements, in percentage form. The categories 
which occur are containers (13 observations), vehicle headlamps (29 observations), tableware 
(9 observations), vehicle window glass (17 observations), window float glass (70 observations) 
and window non-float glass (76 observations). This dataset contains a large number of zeros 
as well, thus excluding ERA from being applied here. The data are available from the 
UC Irvine Machine Learning Repository, 

An interesting feature of this dataset is that it contains many zero values. This means 
that the Aitchisonian metric ([6]) is not to be used. The ES-OVq, and the TCq metrics on 
the other hand are not affected by the presence of zeros, since OlogO = 0. In this example 
the sample size of the test data was equal to 30, hence we used 184 compositional vectors to 
train the /c-NN algorithm. Again, the test data were chosen via stratihed random sampling 
to avoid having categories not been selected in the test sample. Figure [S] shows the estimated 
percentage as a function of k and a using both metrics. 
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a values 


a values 


(a) 


(b) 


Figure 8: The estimated percentage of correct classification for the forensic glass data as a 
function of k, the nearest neighbours and of a using the (a) ES-OVq metric ([3]) and (b) TCq 

( 0 . 


This is a simpler case to draw conclusions, since the best results are obtained when a = 1 
and A; = 2 for both metrics, thus the ES-OV ([2]) and the TC (jl]) metrics should be used, with 
the estimated percentage of correct classihcation being 71.45% and 73.35% respectively. Table 
[2] presents analytical information of the classihcation results. Estimates of the sensitivities 
and of the specihcities for each category of glass are also given. 

The mean sensitivities of ES-OVq, metric ([3]) for Tableware and Vehicle window are low 
and the same is true for the Vehicle window when TC^ ([3]) is used. We observed that 
many times. Tableware and Vehicle window were being wrongly classihed as Vehicle hoat. A 
possible reason for this could be the small sample size of Tableware (this type of glass had 
the minimum number of observations). A chemist or a forensic scientist could perhaps give a 
possible answer to this (if that is the case of these types of glass being of similar structure). 

The ROC curves for each glass category (based on 1-fold cross validation) using both 
metrics are presented in Figure [9l We cannot say that one metric does better than the other 
always. For some glass categories, the two ROC curves are similar and for some others one 
seems a bit better than the other. 


4 Conclusions 

We suggested the use of a recently developed metric ([2]), for supervised classihcation when 
the fc-NN algorithm is implemented. We also added a free parameter to the metric with the 
intention of improving the classihcation results. This free parameter was used to generalize 
the taxicab metric as well. The examples showed that both the ES-OVq ([3]) and the taxicabo 
([5]) metric can be used for supervised clustering of compositional data, but can also be used 
in other scenarios as well. 

An advantage of both metrics over the Aitchisonian metric ([3]) is that they handle zeros 
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ES-OV„ 

Tuning parameters 

Percentage of 
correct classification 

Glass 

categories 

Sensitivities 

Specificities 

a = 1 &L k=3 

71.45% (7.76%) 

Containers 
Vehicle headlamps 
Tableware 
Vehicle window 

Vehicle float 
Non-window float 

77.25%(29.57%) 

80.88%(16.99%) 

36.50%(48.26%) 

29.25%(31.81%) 

81.65%(11.60%) 

68.55%(13.39%) 

97.46%(2.76%) 

96.44%(3.60%) 

96.95%(3.00%) 

97.50%(2.97%) 

82.30%(7.28%) 

90.50%(6.30%) 

TC„ 

Tuning parameters 

Percentage of 
correct classification 

Glass 

categories 

Sensitivities 

Specificities 

a = 1 & k=3 

73.35% (8.00%) 

Containers 
Vehicle headlamps 
Tableware 
Vehicle window 
Vehicle float 
Non-window float 

77.75%(30.37%) 
82.62%(16.66%) 
74.50% (43.70%) 
29.75%(31.74%) 
77.90%(12.58%) 
72.86%(14.45%) 

98.18%(2.43%) 

99.15%(1.77%) 

98.14%(2.70%) 

95.48%(3.71%) 

82.10%(7.52%) 

90.11%(6.99%) 


Table 2: Classification results for the forensic glass data. The number inside the parentheses 
indicates the standard error of the percentages. 


naturally. Th i s imp lies that no zero value r eplacemen t is ne cessary either parametrically 
flMartin et al.l. 120121) or non parametrically flAitchisonl . 120031) . In order to appreciate the 
importance of this advantage one can think of large datasets with many zeros. 

The two metrics outbalanced the Aitchisonian metric ([6]) in the examples presented in 
this manuscript. When it comes to comparing the the ES-OVq ([3]) and the taxicab^ (E]) 
metric between them we cannot say one is better than the other. 

A closer examination of the ROC curves revealed valuable information, especially for the 
FGL data example (where zeros are present) regarding the classihcation abilities of the ES- 
OVa (El) and the taxicab^ ([5]) metric. The sensitivities and specihcities revealed interesting 
patterns of the misclassihcation rates not captured by the percentage of correct classihcation. 
In addition, the ROC curves provided graphical evidence as for the ability of each metric to 
classify the observations. 
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(e) (f) 


Figure 9: ROC curves for all tributaries using the three metrics In all cases a = 1 and 
k = 3 were used in both metrics. Each plot corresponds to one of the six glass categories (a) 
containers, (b) vehicle headlamps, (c) tableware, (d) vehicle window glass, (e) window float 
glass and (f) window non-float glass. 
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